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Abstract 

Materialized views and indexes are physical structures for accelerating data access that are 

I— S 

I casually used in data warehouses. However, these data structures generate some maintenance 

overhead. They also share the same storage space. Most existing studies about materialized 
view and index selection consider these structures separately. In this paper, we adopt the 
opposite stance and couple materialized view and index selection to take view-index interac- 
ts tions into account and achieve efficient storage space sharing. Candidate materialized views 

o 

and indexes are selected through a data mining process. We also exploit cost models that 
evaluate the respective benefit of indexing and view materialization, and help select a relevant 

> 

configuration of indexes and materialized views among the candidates. Experimental results 
show that our strategy performs better than an independent selection of materialized views 
and indexes. 

o 
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1 Introduction 

& 

Large-scale usage of databases in general and data warehouses in particular requires an admin- 
istrator whose principal role is data management, both at the logical level (schema definition) 
and physical level (files and disk storage), as well as performance optimization. With the wide 
development of Database Management Systems (DBMSs), minimizing the administration func- 
tion has become crucial (Chaudhuri & Narasayya, 1997). One important administration task 
is the selection of suitable physical structures to improve system performance by minimizing 
data access time (Finkelstein, Schkolnick, & Tiberio, 1988). 

Among techniques adopted in data warehouse relational implementations for improving 
query performance, view materialization and indexing are presumably the most effective (Rizzi 
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Sz Saltarelli, 2003). Materialized views are physical structures that improve data access time 
by precomputing intermediary results. Therefore, end-user queries can be efficiently processed 
through data stored in views and do not need to access the original data. Indexes are also 
physical structures that allow direct data access. They avoid sequential scans and thereby 
reduce query response time. Nevertheless, exploiting either materialized views or indexes 
requires additional storage space and entails maintenance overhead when refreshing the data 
warehouse. The issue is thus to select an appropriate configuration (set) of materialized views 
and indexes that minimizes query response time and the selected data structures' maintenance 
cost, given a limited storage space. 

The literature regarding materialized view and index selection in relational databases and 
data warehouses is quite abundant. However, we have identified two key issues requiring 
enhancements. First, the actual selection of suitable candidate materialized views and indexes 
is rarely addressed in existing approaches. Most of them indeed present scaling problems at 
this level. Second, none of these approaches takes into account the interactions that may exist 
between materialized views, between indexes, and between indexes and materialized views 
(including the approaches that simultaneously select both materialized views and indexes). 

In this paper, we present a novel strategy for optimizing data warehouse performance 
that aims at addressing both these issues. We have indeed designed a generic approach whose 
objective is to automatically propose solutions to data warehouse administrators for optimizing 
data access time. The principle of this approach is to apply data mining techniques on a 
workload (set of queries) that is representative of data warehouse usage in order to deduce 
a quasi-optimal configuration of materialized views and/or indexes. Data mining actually 
helps reduce the selection problem's complexity and improves scalability. Then, cost models 
help select among the selected materialized views and indexes the most efficient in terms of 
performance gain/overhead ratio. We have applied our approach on three related problems: 
isolate materialized view selection, isolate index selection and joint materialized view and index 
selection. In the last case, we included index-view interactions in our cost models. 

The remainder of this paper is organized as follows. Section 2 presents and discusses 
the state of the art regarding materialized view and index selection. Section 3 motivates and 
presents the principle of our performance optimization approach. Section 4 further details how 
we apply this approach to isolate materialized view selection, isolate index selection and joint 
materialized view and index selection, respectively. We particularly focus on joint materialized 
view and index selection, which is our latest development. Section 5 presents the experimental 
results we achieved to illustrate our approach's relevance. Finally, we conclude this paper and 
provide research perspectives in Section 6. 
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2 Related work 



In this section, we first formalize the materialized view and index selection problem, and then 
detail and discuss the state of the art regarding materialized view selection, index selection 
and joint index and materialized view selection, respectively. 

2.1 Materialized view and index selection: formal problem def- 
inition 

The materialized view and index selection problem consists in building a set of materialized 
views and indexes that optimizes the execution cost of a given workload. This optimization may 
be realized under constraints, typically the storage space available for storing these physical 
data structures. 

Let Vc and Iq be two sets of materialized views and indexes, respectively, that are termed 
candidate and are susceptible to reduce the execution cost of a given query set Q (generally 
supposed representative of system workload). Let Oc — Vc U Ic ■ Let S be the storage space 
allotted by the data warehouse administrator to build objects (materialized views or indexes) 
from set Oc- The joint materialized view and index selection problem consists in building 
an object configuration O C Oc that minimizes the execution cost of Q, under storage space 
constraint. This NP-hard problem (Comer, 1978; Gupta, 1999) may be formalized as follows: 

• cost(Q,0) = rnin(cost(Q,$)) V C O c ; 

• ^^size(o) < S, where size(o) is the disk space occupied by object o. 

oeo 

2.2 Materialized view selection 

The materialized view selection problem has received significant attention in the literature. 
Related researches differ in several points: 

1. the way the set of candidate views Vc is determined; 

2. the framework used to capture relationships between candidate views; 

3. the use of mathematical cost models vs. calls to the system's query optimizer; 

4. view selection in the relational or multidimensional context; 

5. multiple or simple query optimization; 

6. theoretical or technical solutions. 

Classical papers in materialized view selection introduce a lattice framework that models 
and captures dependency (ancestor or descendent) among aggregate views in a multidimen- 
sional context (Harinarayan, Rajaraman, & Ullman, 1996; Baralis, Paraboschi, & Teniente, 
1997; Kotidis & Roussopoulos, 1999; Uchiyama, Runapongsa, & Teorey, 1999). This lattice 
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is greedily browsed with the help of cost models to select the best views to materialize. This 
problem has first been addressed in one data cube and then extended to multiple cubes (Shukla, 
Deshpande, & Naughton, 2000). Another theoretical framework, the AND-OR view graph, 
may also be used to capture the relationships between views (Chan, Li, & Feng, 1999; Nadeau 
& Teorey, 2002; Valluri, Vadapalli, & Karlapalem, 2002; Gupta & Mumick, 2005). Unfortu- 
nately, the majority of these solutions are theoretical and are not truly scalable. 

A wavelet framework for adaptively representing multidimensional data cubes has also been 
proposed (Smith, Li, & Jhingran, 2004). This method decomposes data cubes into an indexed 
hierarchy of wavelet view elements that correspond to partial and residual aggregations of 
data cubes. An algorithm greedily selects a non-expensive set of wavelet view elements that 
minimizes the average processing cost of data cube queries. In the same spirit, Sismanis, Deli- 
giannakis, Roussopoulos, and Kotidis (2002) proposed the Dwarf structure, which compresses 
data cubes. Dwarf identifies prefix and suffix redundancies within cube cells and factors them 
out by coalescing their storage. Suppressing redundancy improves the maintenance and in- 
terrogation costs of data cubes. These approaches are very interesting, but they are mainly 
focused on computing efficient data cubes by changing their physical design. 

Other approaches detect common sub-expressions within workload queries in the relational 
context (Goldstein & Ake Larson, 2001; Baril & Bellahsene, 2003; Rizzi & Saltarelli, 2003). 
The view selection problem then consists in finding common subexpressions corresponding to 
intermediary results that are suitable to materialize. However, browsing is very costly and 
these methods are not truly scalable with respect to the number of queries. 

Finally, the most recent approaches are workload-driven. They syntactically analyze a 
workload to enumerate relevant candidate views (Agrawal, Chaudhuri, & Narasayya, 2000). 
By exploiting the system's query optimizer, they greedily build a configuration of the most 
pertinent views. A workload is indeed a good starting point to predict future queries because 
these queries are probably within or syntactically close to a previous query workload. In 
addition, extracting candidate views from the workload ensures that future materialized views 
will probably be used when processing queries. 

2.3 Index selection 

The index selection problem has been studied for many years in databases (Finkelstein et al., 
1988; Frank, Omiecinski, & Navathe, 1992; Agrawal et al., 2000; Valentin, Zuliani, Zilio, 
Lohman, & Skelley, 2000; Feldman & Reouven, 2003; Kratica, Ljubic, & Tosic, 2003; Chaud- 
huri, Datar, & Narasayya, 2004). In the more specific context of data warehouses, exist- 
ing research studies may be clustered into two families: algorithms that optimize mainte- 
nance cost (Labio, Quass, & Adelberg, 1997) and algorithms that optimize query response 
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time (Gupta, Harinarayan, Rajaraman, & Ullman, 1997; Agrawal, Chaudhuri, & Narasayya, 
2001; Golfarelli, Rizzi, & Saltarelli, 2002). In both cases, optimization is realized under storage 
space constraint. In this paper, we focus on the second family of solutions, which is relevant 
in our context. Studies falling in this category may be further categorized depending on how 
the set of candidate indexes Ic and the final configuration of indexes I are built. 

Selecting a set of candidate indexes may be automatic or manual. Warehouse administra- 
tors may indeed appeal to their expertise and manually provide, from a given workload, a set 
of candidate indexes (Frank et al., 1992; Choenni, Blanken, & Chang, 1993a, 1993b). Such a 
choice is however subjective. Moreover, the task may be very hard to achieve when the number 
of queries is very high. In opposition, candidate indexes can also be extracted automatically, 
through a syntactic analysis of queries (Chaudhuri & Narasayya, 1997; Valentin et al., 2000; 
Golfarelli et al., 2002). Such an analysis depends on the DBMS, since each DBMS is queried 
through a specific syntax derived from the SQL standard. 

The methods for building a final index configuration from candidate indexes may be cate- 
gorized into: 

1. ascending or descending greedy methods; 

2. methods derived from genetic algorithms; 

3. methods assimilating the selection problem to the well-known knapsack optimization 
problem. 

Ascending greedy methods start from an empty set of candidate indexes (Kyu- Young, 1987; 
Frank et al., 1992; Choenni et al., 1993b; Chaudhuri & Narasayya, 1997). They incrementally 
add in indexes minimizing cost. This process stops when cost ceases decreasing. Contrarily, 
descending greedy methods consider the whole set of candidate indexes as a starting point. 
Then, at each iteration, indexes are pruned (Kyu- Young, 1987; Choenni et al., 1993a). If 
workload cost before pruning is lower (respectively, greater) than workload cost after pruning, 
the pruned indexes are useless (respectively, useful) for reducing cost. The pruning process 
stops when cost increases after pruning. 

Genetic algorithms are commonly used to resolve optimization problems. They have been 
adapted to the index selection problem (Kratica et al., 2003). The initial population is a set 
of input indexes (an index is assimilated to an individual). The objective function to optimize 
is the workload cost corresponding to an index configuration. The combinatory construction 
of an index configuration is realized through the crossover, mutation and selection genetic 
operators. Eventually, the index selection problem has also been formulated in several studies 
as a knapsack problem (Ip, Saxton, & Raghavan, 1983; Gundem, 1999; Valentin et al., 2000; 
Feldman & Reouven, 2003) where indexes are objects, index storage costs represent object 
weights, workload cost is the benefit function, and storage space is knapsack size. 
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2.4 Joint materialized view and index selection 

Few research studies deal with simultaneous index and materialized view selection. Agrawal 
et al. (2000) have proposed three strategies. The first one, MVFIRST, selects materialized 
views first, and then indexes, taking the presence of selected views into account. The second 
alternative, INDFIRST, selects indexes first, and then materialized views. The third alterna- 
tive, joint enumeration, processes indexes, materialized views and indexes over these views at 
the same time. According to the authors, this approach is more efficient than MVFIRST and 
INDFIRST, but no further details are provided. 

Bellatreche, Karlapalem, and Schneider (2000) studied storage space distribution among 
materialized views and indexes. First, a set of materialized views and indexes is designed as 
an initial solution. Then, the approach iteratively reconsiders the solution to further reduce 
execution cost, by redistributing storage space between indexes and materialized views. Two 
agents are in perpetual competition: the index spy (respectively, view spy) steals some space 
allotted to materialized views (respectively, indexes), and vice versa. The recovered space is 
used to create new indexes (respectively, materialized views) and prune views (respectively, 
indexes), according to predefined replacement policies. 

Another approach a priori determines a trade-off between storage space allotted to indexes 
and materialized views, depending on query definition (Rizzi & Saltarelli, 2003). According 
to the authors, the key factors to leverage query optimization is aggregation level, defined by 
the attribute list of Group by clauses in SQL queries, and the selectivity of attributes present 
in Where and Having clauses. View materialization indeed provides a great benefit for queries 
involving coarse granularity aggregations (few attributes in the Group by clause) because they 
produce few groups among a large number of tuples. On the other hand, indexes provide 
their best benefit with queries containing high selectivity attributes. Thus, queries with fine 
aggregations and high selectivity stimulate indexing, while queries with coarse aggregations 
and weak selectivity encourage view materialization. 

Finally, Bruno and Chaudhuri (2006) have recently worked on refining the physical design 
of relational databases. Their objective was to automatically improve an expert's physical 
design, to take into account primordial constraints it might violate. Hence, they proposed a 
transformation architecture base on two fusion and reduction primitives that helps process 
indexes and materialized views in a unified way. 

2.5 Discussion 

Existing studies related to index and materialized view selection are numerous and diverse in 
the field of databases, and quite developed in the field of data warehouses as well. However, 
we have identified two main points that could be improved in these approaches. 
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2.5.1 Candidate object selection 

Selecting candidate objects (materialized views and indexes) is rarely the focus of existing 
approaches, most of which do not scale up well at this level. Many index selection strategies 
indeed rest on human expertise (the warehouse administrator's) to propose an initial candidate 
index configuration. Given the size and complexity of most data warehouses, an automatic 
approach is mandatory to apply these methods on a real-life scale. The most recent studies 
actually take this option, by building the initial index configuration from system workload. 

With respect to materialized views, various data structures have been proposed (lattices, 
graphes, wavelets...) to model inter- view relationships. None of them scale up very well. For 
instance, browsing a candidate view lattice is very costly when the input data cube is very 
large. Similarly, building view graphs is as complex as the input workload is large. Hence, it 
is necessary to carefully evaluate a strategy's complexity before adopting it, and to optimize 
any data structure used. 

2.5.2 Inter-object interaction management 

None of the approaches we have presented in this section takes into account the interactions 
that may exist between indexes, between materialized views, and between index and views, 
including joint selection methods. Existing studies, especially those assimilating the selection 
problem to the knapsack problem or exploiting genetic algorithms, indeed compute the cost 
or benefit of an object (index or materialized view) once only, before injecting it in their 
algorithm. However, the relevance of selecting a given object may vary from one iteration to 
the other if another, previously selected object interacts with the first one. It is thus primordial 
to recompute costs or benefits dynamically before object selection. 

The nearest solution is the one by Bellatreche et al. (2000). However, its object replacement 
policies in the disk spaces allotted to indexes and materialized views do not truly reflect index- 
view interactions. They indeed only consider joint usage frequency in queries, and not the 
benefit an object brings with respect to other objects. 

3 Data mining-based warehouse performance opti- 
mization approach 

In this section, we first motivate our performance optimization approach. Then, we present its 
general principle, detail how candidate objects are selected and how a final object (materialized 
view and index) configuration is generated. 
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3.1 Motivation 

In this paper, our objective is to address the issues identified in Section 2.5. First, to ensure that 
candidate object (materialized view or index) selection scales up, it is necessary to devise an 
automatic approach. Generally, this is achieved by syntactically analyzing the system's query 
workload, which helps identify query attributes that might support indexes or materialized 
views. These attributes are then systematically combined to propose multi-attribute indexes 
or exhaustive view graphs. However, this strategy later leads, in the selection phase, to consider 
irrelevant objets, i.e., objects that do belong to the workload, but are not interesting in the 
scope of indexing or view materialization. 

To a priori eliminate these irrelevant objects, we propose to exploit data mining techniques 
to directly extract from the workload a configuration of pertinent candidate objects. Our idea 
is to discover co-occurencies and similarities between workload objects. For indexing, we 
base our approach on the intuition that the importance of an attribute to index is strongly 
correlated with its appearance frequency in the workload. For view materialization, devising 
similar classes of queries also helps build views that are likely to answer all the queries from a 
given class. 

On the basis of the smallest possible set of candidate, all relevant objects, we must then 
exploit an optimization algorithm (typically a greedy, knapsack or genetic algorithm) to build 
a quasi-optimal object configuration. However, to take index-view interactions into account, 
such algorithms must be modified. In a given iteration, an object's cost indeed depends on 
previously selected objects. Thus, it must be recomputed at each step. For simplicity reasons, 
we implemented this approach in a greedy algorithm. 

3.2 General principle of our approach 

Our automatic warehouse performance optimization approach (Figure 1) is not only based on 
information extracted from the warehouse's data (statistics such as attribute selectivity, for 
instance) or workload, but also on knowledge. This knowledge includes classical warehouse 
metadata (we notably exploit the database schema), as well as administration expertise, for- 
malized in cost models (benefit induced by an index and maintenance cost, for instance) or 
rules. Our approach proceeds in two main steps, which are both piloted by knowledge. 

The first step is building a candidate object configuration Oc- It consists in syntactically 
analyzing the input workload, which helps identify attributes that might be useful for view 
materialization or indexing. Applying rules issued from administration-related knowledge can 
already reduce the size of this attribute set. For instance, a low selectivity attribute such as 
gender, which has only two values, is not a good candidate index. This set of attributes is 
then structured as an attribute-value table that can be processed by a data mining algorithm. 
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Figure 1: Principle of our automatic performance optimization approach 

The output of such an algorithm is directly the candidate object configuration. 

Since disk space is constrained, it would be impossible to exploit all the candidate objects 
from Oc- Thus, the second step in our process is greedily selecting a final object configuration 
O from Oc- This algorithm exploits cost models we have developed to express, e.g., the benefit 
brought by a materialized view or an index, as well as their storage and maintenance costs 
(Section 4). Eventually, the last step in our approach consists in implementing the final object 
configuration in the data warehouse. 

Note that we have designed this approach in a modular fashion, so that it is as generic 
as possible. Completing the two main steps indeed brought us to perform choices, but other 
options would be easy to consider. For instance, the data mining technique we selected for 
building a candidate index configuration is frequent itemset mining, but another study explored 
clustering instead (Zaman, Surabattula, & Gruenwald, 2004). Besides, we have also used 
clustering for materialized view selection. Similarly, other optimization algorithms could be 
substituted to the greedy strategy we adopted to build the final object configuration. Our cost 
models might also be easily replaced with others if necessary, or by calls to a query optimizer, 
if it is accessible on the host DBMS. 

3.3 Candidate object selection 

System workload is typically accessible from the host DBMS' transaction log. A given work- 
load is supposed representative if it has been measured during a time period the warehouse 
administrator judges sufficient to anticipate upcoming transactions. 

Since we are more particularly interested in decision-support query performance and not 
warehouse maintenance, we only consider interrogation query workloads in this paper. These 
queries are typically composed of join operations between the fact table and dimensions, re- 
striction predicates and aggregation and grouping operations. More formally, an analytic query 
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q may be expressed as follows in relational algebra: q = Hg,m(o~r(F x Di ixi D 2 txs ... ix Dd)), 
where G is the set of attributes from dimensions Di that are present in q's grouping clause, 
M is a set of aggregate measures from fact table F and 7? a conjunction of predicates over 
dimension attributes. 

Attributes that may support materialized views or indexes belong to sets G and R (Chaud- 
huri & Narasayya, 1997; Golfarelli et al., 2002; Valentin et al., 2000; Feldman & Reouven, 
2003). We reference them in a so-called "query-attribute" binary matrix whose rows represent 
workload queries <?; and whose columns are representative attributes cij. The general term 
rriij of this matrix is equal to one if attribute aj is present in query qi, and to zero otherwise. 
A simple example of query-attribute matrix based on the workload excerpt from Figure 2 is 
featured in Table 1. 
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Figure 2: Workload excerpt 





Oi 


03 


04 


05 


a 7 


a 8 


ag 


aio 


9i 


1 


1 


1 

















92 


1 


1 





1 


1 


1 








93 


1 


1 














1 


1 



Table 1: Sample query-attribute matrix 



This data structure directly corresponds to attribute- value tables that are exploited by data 
mining algorithms. Here, attributes are queries and values, attributes. Applying a data mining 
technique onto the query-attribute matrix helps obtain a set of candidate objects (materialized 
views and indexes) O c . 

3.4 Final object configuration construction 

Our final materialized view and index configuration construction algorithm (Figure 3) is based 
on an ascending greedy search within the input candidate object set Oc- It starts from an 
empty final object configuration O, and then adds in it object o from Oc that maximizes 
objective function fo, at each iteration. For each object o G Oc, the value of fo(o) depends 
on objects already selected in O. Thus, it must be recomputed at each iteration, which helps 
take view-index interactions into account. The algorithm ends when objective function fo 
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cannot be improved any more, when there are no more candidate objects in Oc, or when 
storage space S allocated by the warehouse administrator to materialized views and index is 
full. 



O = 
so = 
Repeat 

Omax = 
fmax = 

For each o G Oc do 

If fo(o) > fmax then 

Omax = 
fmax = fo(o) 

End if 
End for 

If fo(o ma x) > then 
= OU{o max } 
so = so + size(o ma x) 

Oc = Oc \ {Omax} 

End if 

Until fo(o m ax) < or Oc = or so > S 



Figure 3: Final object configuration construction algorithm 



For a given workload Q and an object configuration O, the objective function fo may 
generally be expressed as follows: fo(°) = a o benefito(o) — /3 a maintenance(o) . Generally, 
benefito(o) = C03t ( < 3' )~^*W- 0u l°l) _ However, taking view-index interactions into account 
complicates this function's computation (Section 4.3.3). 

Cost models developed in Section 4 help compute the cost and maintenance functions. 
Coefficient a helps ponder benefit. It is generally equal to one, but may also help favor 
index that avoid join operations (Section 4.2). Finally, coefficient /3 = \Q\p(o) is an estimator 
for the number of updates of object o. The update probability of object o, p(o), is equal to 

1 ^refreshment where %refreshment represents the p ropor tion of Warehouse Updates with 

(^1 A: in lev rogation ' /otnter -rogation r r- r sr 

respect to interrogations. 



4 Applications 

This section presents three instances of our automatic data warehouse performance optimiza- 
tion approach: automatic materialized view selection, automatic index selection, and auto- 
matic, joint materialized view and index selection. We particularly detail this last, newest 
application. Moreover, we particularly insist, for each application, on its specificities in terms 
of candidate object selection (e.g., the data mining technique we exploited) and cost models 
used in building the final object configuration. 
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4.1 Clustering-based materialized view selection 

In this application, we propose to select materialized views by clustering queries from workload 
Q. Several syntactically similar queries have indeed a high probability of being resolved by 
one single materialized view. Then, we must build classes of similar queries from Q. Since the 
number of classes is a priori unknown, we have selected an unsupervised clustering method. 

Our approach's principle is similar to SQL workload compression (Chaudhuri, Gupta, & 
Narasayya, 2002), a technique proposed in the relational database context to optimize, for 
instance, index selection or approximate answer to aggregation queries. We adapted this 
idea to the context of relational (with an SQL decision-support workload) and XML (with 
an XQuery decision-support workload) data warehouses (Aouiche, Jouve, & Darmont, 2006; 
Mahboubi, Aouiche, & Darmont, 2006). 

The main improvement brought by our approach lies at the candidate view selection level. 
Most anterior methods indeed build a lattice or graph of all syntactically correct views for a 
given workload. However, in practice, such data structures are complex to build and browse. 
Using a clustering algorithm helps drastically reduce the number of candidate materialized 
views by proposing only a couple of views per class (only one in the best case — Section 4.1.1) 
instead of one view per workload query. This dimensionality reduction helps improve the whole 
process' efficiency and offers true scaling up capability. 

4.1.1 Candidate materialized view selection 

Query similarity and dissimilarity. To perform clustering and check out whether 
query classes are homogeneous, we must define query similarity and dissimilarity measures. 
Let M be a query-attribute matrix of general term rriij, defined on query set Q — {qi, i = l..n} 
and attribute set A = {a,j,j = 1../}. We define the elementary similarly and dissimilarity 
between two queries qi and q^ , regarding attribute cij, as follows. 



Note that these definitions are not symmetric. The absence of a given attribute in two 
queries does indeed not constitute an element of similarity, unlike its presence. We now extend 
these definitions onto attribute set A to obtain global similarity and dissimilarity between 
queries qi and qy. 



Query clustering. The objective of clustering is to build a natural partition of queries 
that reflects their internal structure. Objects in the same class must be strongly similar, 
while objects from different classes must be strongly dissimilar. Let P — {Ck, k = l..p} be a 
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partition of p classes (query sets). We define interclass similarity between two distinct classes 

C a and Cb from P, as well as intraclass dissimilarity within class C a from P, as follows. 

Sim(C a ,Cb) = sim(qi,qi>) Dissim(C a ) — dissim(qi,qii) 

qiec a , qi ,ec b , qieCa,qiieCa,i<i' 
Eventually, we define on P a measure of clustering quality Q(P) that helps capture the 

partition's natural aspect. Q(P) indeed possesses low values for partitions that have a strong 

intraclass homogeneity and a strong interclass disparity. Q(P) must be minimized. 

2 

Q(P)= Sim{C a ,C h ) + ^Dissim(C a ) 

a — 1. .p,b= 1 . .p,a<_b a — 1 

To actually perform clustering, we selected the Kerouac algorithm (Jouve & Nicoloyannis, 
2003) that bears interesting features in our context. It can indeed easily take our quality 
measure Q(P) into account, as well as integrate constraints in the clustering process. It 
is thus possible to satisfy a precondition in the materialized view fusion process (see next 
paragraph): queries from one given class must share the same joining conditions. 

Candidate view fusion. The output of clustering is a set of similar query classes. Our 
objective is to associate to each class the smallest possible number of materialized views that 
cover all the class' queries. To achieve this goal, we consider each query as a potential view 
and run a fusion process to decrease their number. The algorithm we use is very similar to 
the one proposed by Agrawal et al. (2000). However, in our context, it is much more efficient 
since it is applied onto a limited number of views in each class instead of the whole set of 
candidate views derived from the workload. The output of fusion applied on classes obtained 
in the previous step is the set of candidate materialized views. 

4.1.2 Cost models 

In most of the (relational) data warehouse cost models from the literature, the cost of a query 
q is supposed proportional to the size (in tuples) of the materialized view exploited by q (Gol- 
farelli & Rizzi, 1998). The same assumption is made for view maintenance cost. Hence, we 
reuse a model that estimates the size of a given materialized view. It has been proposed by Gol- 
farelli and Rizzi (1998) and exploits Yao (1977)'s formula to estimate the number of tuples \V\ 
of a view V composed of k attributes ai, 02, a*; and based on fact table F and d dimensions 

Ih.Ih !>r. \V\ = nun: sizc(V) x (l \\ " — ^ 



\ i=l 
k d 



|F| rnax_size(F) x (1 - max J ize(v) + l 
maxsize(F) — i + 1 



where rnax_size(V) — J^J a;| and rnax_size(F) — J^J \ Di\- 

i=l i=l 

When ratio ' s ^'S n enough, Cardenas (1975)'s formula helps obtain a good 

approximation: \V\ = max_size(V) x ^1 — ^1 — max ^ ize ^ v ^ ^ ^ • 

n 

Vs size in bytes is then size(V) = \V\ x ^""^ size(dj), where size(di) is the size in bytes 

i=l 

of dimension di from V (which can be directly obtained from the warehouse metadata) and 
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n the number of dimensions in V. Yao and Cardenas' formulae assume data are uniformly 
distributed and tend to overestimate view size. However, they are easy to implement and fast 
to compute. Other, more precise methods exploit data sampling and statistical laws (Shukla, 
Deshpande, Naughton, & Ramasamy, 1996; Chaudhuri &: Motwani, 1999; Nadcau & Teorey, 
2001), but they are much harder to implement. 

Eventually, this cost model is very easy to adapt to the XML context by establishing equiv- 
alences between relations and XML documents on one hand, and tuples and XML elements on 
the other hand. The only true difference lies in size(di)'s computation, but it is also obtained 
from warehouse metadata in the XML context. 

4.2 Frequent itemset mining-based index selection 

In this application, we work on optimizing the execution of join operations in a decision-support 
query workload. We propose an index selection based on the extraction from the workload of 
frequent attributes that may support indexes. 

We have first worked on classical, B-tree-like indexes (Aouiche, Darmont, & Gruenwald, 
2003). We focus in this paper on bitmap index selection (Aouiche, Darmont, Boussa'id, & 
Bentayeb, 2005). These data structures (O'Neil & Graefe, 1995) are particularly adapted to 
the data warehouse context. They indeed render logical and counting operations efficient (they 
operate directly on bitmaps stored in the main memory), and help precompute join operations 
at index creation time. Moreover, bitmap storage space is small, especially when the indexed 
attributes' cardinality is low, which is usually the case in a warehouse's dimensions. 

Our approach's originality mainly lies in the use of frequent itemset mining for selecting 
the most pertinent candidate indexes. However, it also has another advantage. The few 
approaches that help select multi-attribute indexes exploit an iterative process to build them: 
mono-attribute indexes in the first iteration, 2-attribute indexes in the second, and so on 
(Chaudhuri & Narasayya, 1997). In our approach, frequent itemsets, which are attribute sets 
of variable size, help directly propose multi-attribute candidate indexes. Furthermore, these 
candidate indexes are a priori pertinent, while combinations generated from smaller candidate 
indexes are not necessarily all pertinent. Thus, our approach avoids pruning them by providing 
a smaller set of pertinent candidates. 

Eventually, most existing index selection techniques (Section 2.3) only exploit B-tree in- 
dexes. Though this type of index is widely used in DBMSs, it is not the best adapted to index 
voluminous data and low cardinality attributes. In the data warehouse context, bitmap join 
indexes we privilege are more efficient. 
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4.2.1 Candidate index selection 

When building the extraction context (query-attribute matrix) that is exploited by a data 
mining algorithm to select candidate indexes, we use knowledge relative to database admin- 
istration and performance optimization, much like Feldman and Reouven (2003). Such an 
attribute preselection helps reduce the mining algorithm search space and, mechanically, im- 
proves its response time. 

Knowledge is formalized under the form of "if-then" rules, e.g., "if a predicate is like 
attribute =fc value, then attribute must not be selected". Such a predicate would indeed not 
exploit an index defined on attribute, all its values being scanned but value. 

We base the final selection of candidate indexes on the intuition that the importance 
of an attribute to index is strongly correlated to its appearance frequency in the workload. 
Frequent itemset mining (Agrawal & Srikant, 1994) appears as a natural solution to extract 
these attributes. Many frequent itemset mining algorithms are available in the literature. We 
selected Close (Pasquier, Bastide, Taouil, & Lakhal, 1999), which presents several advantages 
in our context. 

First, Close helps process voluminous workloads, ft indeed exploits Galois closure opera- 
tors, which reduce the number of accesses to the extraction context when searching for frequent 
itemsets. Close is also efficient when the extraction context is dense, which is our case, since 
query sets often form logical suites. Eventually, closed frequent itemsets extracted by this 
algorithm are fewer than all frequent itemsets (which can nonetheless be generated from the 
closed frequent itemsets). This helps reduce computing time and avoid multiplying useless 
candidate indexes. 

4.2.2 Cost models 

Data access cost through a bitmap join index. Data access is performed in two 
steps: scanning the index' bitmaps, and then reading the tuples. If access to the bitmaps 
is direct and data are uniformly distributed, which is a reasonable assumption according to 
Choenni et al. (1993a), index traversal cost is d ^gj . d is the number of predicates applied 
on indexed attribute A. F is the fact table. S p is the size of a disk page. ^ A g F represents the 
size of the bitmap index (Wu & Buchmann, 1998) . 

The number of tuples read by a query using d bitmaps is dj^j if data are uniformly dis- 
tributed. The number of corresponding input/output is then equal to Pf(1 — e "f\ a \ ) (O'Neil 
& Quass, 1997), where pf is the number of disk pages that are necessary to store F. Finally, 
cost = +Pf{1 - e~^W). 

If bitmap access is performed through a B-tree, as is the case in the Oracle DBMS, for 
instance, B-tree descent cost must be taken into account: Zogml^l — 1, where m is the B-tree 
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order. Leaf nodes traversal cost is then at worst. However, bitmap index traversal cost 

d\F\ 

is reduced to df£. Then, cost = log m \A\ - 1 + df^ + Pf{1 - e pfW). 

Bitmap join index maintenance cost. Let a bitmap join index be defined on attribute 
A from dimension D. When inserting a tuple into fact table F, D must be traversed to find 
the tuple that must be joined to the one inserted in F: po pages are read. Then, the index' 
bitmaps must be updated. At worst, they are all traversed and ^ F pages are read. Hence, 
maintenanceF = Pd + ■ 

When inserting a tuple into dimension D, update may be without domain expansion, then 
a bit corresponding to the inserted tuple must be added to each bitmap; or with domain 
expansion, then a new bitmap must be created. Then, maintenanceo = Pf + (1 + £)^4sjF> 
where £ = 1 when expanding the domain and £ = otherwise. 

4.3 Joint materialized view and index selection 

In this eventual application, wc seek to select a configuration of materialized views and indexes 
that are mutually beneficial, in order to further optimize the response time of decision-support 
queries. More precisely, we aim at truly taking view-index interactions into account and at 
optimizing storage space sharing between materialized views and indexes. Existing approaches 
indeed consider indexes and materialized views as distinct objects, whose benefit and mainte- 
nance cost are invariant and independent from already-selected objects. Moreover, few consider 
indexing materialized views. 

Bellatreche et al. (2000) 's approach, which is closest to ours, starts from an initial solution 
composed of indexes and materialized views isolately selected under storage space constraint. 
Taking this constraint into account a priori might eliminate solutions that are susceptible to 
become pertinent in the next iterations of the selection process. Hence, we only introduce 
the storage space constraint a posteriori, within the selection algorithm. Furthermore, object 
replacement policies in storage spaces respectively allotted to indexes and materialized views 
exploit these objects' usage frequency, and not the benefit brought by their simultaneous usage. 

4.3.1 Candidate object selection 

First, let us detail and specialize the automatic performance optimization strategy presented 
in Section 3.2 for joint materialized view and index selection. Here, we exploit the modular 
structure of our approach: our input is a set of candidate objects (materialized views and 
indexes) obtained with any existing selection algorithm, such as the ones we propose. Then, 
we exploit specific data structures and cost models to recommend a pertinent configuration of 
materialized views and indexes through the following steps (Figure 4): 
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1. extract a representative query set Q from system workload; 

2. build a set of candidate materialized views Vc using the approach described in Sec- 
tion 4.1.1, with Q as input; 

3. build a set of candidate indexes Ic using the approach described in Section 4.2.1, with 
Q U Vc as input; 

4. simultaneously select materialized views and indexes from Oc = Vc U Io\ 

5. build the final configuration of materialized views and indexes O C Oc under storage 
space constraint S. 
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Figure 4: Joint materialized view and index selection approach 



4.3.2 Specific data structures 

After building the set of candidate materialized views, indexes and indexes on views Oc, we 
aim at combining them to recommend a pertinent configuration of materialized views and 
indexes O. To consider the relationships between these objects in this process, we need to 



17 



materialize them. For this purpose, we use three binary matrices: the "query-view" matrix, the 
"query-index" matrix and the "view-index" matrix that we detail in the following paragraphs. 

To better illustrate how these data structures are designed, let us consider the workload 
sample from Figure 5. Candidate materialized views and indexes obtained from this workload 
by applying our strategy are featured in Figures 6 and 7, respectively. 
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select .sales. timc_id, sum( amount _sold) 


95 


select promotions . promo_name, 




from sales, times 


sum ( amount _sold) 




where sales .time_id — times. timc_id 




from sales, promotions 




and times. time_fiscal_year = 2000 




where sales . promoid = promotions. promo_id 




group by sales. time_id 




and promotions.promo_begin_date='30/01/2000' 








and promot ions. promo_cnd_datc= '30/03/2000' 








group by promotions. promo_namc 


92 


select sales. prod_id, 


96 


select customers.cust_marital_status, 


sum(amount_sold) 


sum(quantity_sold) 




from sales, products, promotions 




from sales, customers, products 




where sales . prod_id = products. prod_id 




where sales .cust_id = customers. cust_id 




and sales. promo_id = promotions . promo_id 




and sales. prod_id = products. prod_id 




and promotions. promo_category = 'news paper' 




and customers. cust_gcndcr = 'woman' 




group by sales. prod_id 




and products. prod_namc = 'shampooing' 








group by customers. cust_first_namc 


13 


select customers. cust_gcndcr, sum(amount_sold) 


97 


select products. prod.namc, sum(amount_sold) 


from sales, customers, products, 


from sales, products, promotions 




where sales. cust_id = customers . cust_id 




where sales . prod_id = products. prod_id 




and sales. prod_id = products . prod_id 




and sales. promo_id =promotions.promo_id 




and customers. cust_marital_status ='singlc' 




and products. prod_catcgory = 'tec shirt' 




and products. prod.catcgory = 'women' 




and promotions. promo_end_date= '30/04/2000' 




group by customers. cust_gender 




group by products. prod.namc 


94 


select products. prod_namc, sum ( amount _sold) 


98 


select channels. channcl_dcsc . sum ( quant ity_sold) 




from sales, products, promotions 




from sales, channels 




where sales. prod_id = products. prod_id 




where sales . channel-id = channels. channel-id 




and sales. promo_id = promotions . prom.id 




and channels. channel-class = 'Internet' 




and promotions . promo.catcgory — 'TV' 




group by channels. channcl.desc 




group by products. prod_namc 







Figure 5: Sample workload 



Query-view matrix. The query-view matrix (QV) captures existing relationships be- 
tween workload queries and the materialized views extracted from these queries, i.e., views 
that are exploited by at least one workload query. This matrix may be viewed as the result 
of rewriting queries with respect to candidate materialized views. The query-view matrix' 
rows and columns are workload queries and candidate materialized views, respectively. Its 
general term QV qv is equal to one if a given query q exploits the corresponding view v, and 
to zero otherwise. Table 2 presents the query-view matrix corresponding to the example from 
Figures 5 and 6. 
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Table 2: Sample query- view matrix 



Query-index matrix. The query-index matrix (QI) stores the indexes built on base ta- 
bles. This matrix may be viewed as the result of rewriting queries with respect to candidate in- 
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Vi create materialized view vi as 

select sales. time_id, times. time_fiscal_year, 
sum ( amount _sold) 
from sales, times 

where sales. timc_id = times .time.id 

group by sales. timc_id, times. timcs_fiscal_year 

v-2 create materialized view v<2 as 

select sales. prod_id, sales. cust_id, channels. channel_desc, 

channels. channel_class, sum ( quant ity_sold) 

from sales, channels, products, customers 

where sales. prod_id = products. prod_id 

and sales. channeled — channels. channcljd 

and sales. cust_id = customers. cust_id 

group by sales. prod_id, sales. cust_id, channels. channel_desc, 
channels. channcl_class 

tjg create materialized view as 

select customer s.cust_first_namc, products.prod_name, 

pro ducts, pro d_category, customers . cust -gender, 

customcrs.cust_marital_status, sum (sales, quant ity_sold) 

from sales, customers, products 

where sales. cust_id ~ customers. cust_id 

and sales . prod_id = products. prod_id 

group by customcrs.cust_first.namc, products. prod_name, 
pro ducts, pro d.category, customers . cust.gcndcr, 
customers . cust_marital_status 

?.'4 create materialized view ^4 as 

select products. prod_name, products. prod_catcgory, 

promotions.promo_category, sum (amount _sold) 

from sales, products, promotions 

where sales. prod_id = products. prod_id 

and sales . pro mo.id = promotions.promo.id 

group by products.prod_name, products .prod .category, 

promotions.promo.catcgory 

1)5 create materialized view as 

select sales. prod _id, products, prod .category, 
promotions.promo.catcgory, sum( amount _sold) 
from sales, products, promotions 
where sales. prod_id = = products. prod_id 
and sales. promo.id = promotions. promo-id 

vq create materialized view vq as 

select channels . channel _class , product s.prod.namc, channels. channel _dcsc, 

products . prod.category, sum (sales, quant it y_sold) , sum (sales, amount _sold) 

from sales, channels, products 

where sales. prod_id = products. prod_id 

and sales. channeled = channels. channeled 

group by channels. channcLclass, products. prod_name, 

pro ducts. prod_category, channels. channel_desc 

Vf create materialized view Vf as 

select sales. prod_id, products, prod -category, 
channels. channel_desc, promotions.promo.namc, 
promotions.promo_bcgin_date, promotions. promo.end.date, 
products.prod.namc, sum (sales, quant ity.sold) , sum (sales, amount _sold J 
from sales, products, promotions 
where sales. prod.id = products. prod.id 
and sales . promoid = promotions. promodd 
and sales . channel_id = channels. channcLid 

group by sales. prod_id, products. prod_catcgory, channels. channcLdcsc, 
promotions.promo_namc, promotions. promo _bcgin_date, 
promotions.promo.cnd.datc, products. prod.name 



Figure 6: Candidate materialized views 



dexes. The query-index matrix' rows and columns are workload queries and candidate indexes, 
respectively. Its general term QI q i is equal to one if a given query q exploits the corresponding 
index i, and to zero otherwise. Table 3 presents the query-index matrix corresponding to the 
example from Figures 5 and 7. 



View-index matrix. The view-index matrix (VI) identifies candidate indexes that are 
recommended for candidate materialized views returned by our view selection algorithm. The 
query-index matrix' rows and columns are candidate views and candidate indexes on these 
views, respectively. Its general term VI V i is equal to one if a given materialized view v 
exploits the corresponding index i, and to zero otherwise. Table 4 presents the view-index 
matrix corresponding to the example from Figures 6 and 7. 
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Figure 7: Candidate indexes 
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Table 3: Sample query-index matrix 



4.3.3 Cost models 

We have already presented in Sections 4.1.2 and 4.2.2 cost models relative to materialized 
views and bitmap join indexes, respectively. Since indexes defined on materialized views are 
generally B-trees or derivatives, we first recall here the cost models that relate to these indexes. 
Then, we discuss the benefit of view materialization vs. indexing. 

Data access cost through a B-tree index. Data access through an index is subdivided 
into to steps: index traversal to find key values corresponding to the query (Ctraversai cost), 
and then searching for these identifiers in the database (Csearch cost). Let q be a query, £ a set 
of indexes, SNA q the set of attributes that are present in query q's restriction clause (the Where 
clause in SQL), BF a the bloc factor of the index built on attribute a (the average number of 
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Table 4: Sample view-index matrix 
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1. 



(key, identifier) couples per disk page), SF a the selectivity factor of attribute a, and finally 
v the accessed materialized view. Then: C traversal = \l°gBF a \v\] + ^^i^ 

ae«nSNA g ) 

The number of identifiers to search for is then N = \v\ | | SF a . According, to Car- 

ae{CnSNA q ) 

denas (1975)'s formula, the number of disk pages to access is: C sear ch — S p ^1 — (1 — ~i~) N ^J > 
where S p represents disk page size. 

Finally, data access cost through a B-tree index is cost — C traversal + C 'search- 

B-tree index maintenance cost. Classically, this cost is expressed as follows: 
maintenance = C op (a); where f ins , fdei and f upd are insert, delete 

and update frequencies, respectively; and Cj„ s (o), Cdei(a) and C up d(a) are maintenance costs 

related to an insert, delete or update operation on attribute a, respectively. A op is the set of 

considered attributes. Ai n3 — Adei = C where £ is the set of indexes to maintain. 

A up d — C l~l SNA upd , where SNA upd is the set of attributes to update. Finally, maintenance 

costs are the following (Whang, 1985): Ci„ s (a) = Cdei(a) — \logBF a \v\] and 

C up d(a) = \log B F a \v\\ + [i^f] - 1- 

View materialization and indexing benefit. In the general case (Section 3.4), the 
benefit brought by selecting an object o is defined as the difference between the execution cost 
of query workload Q before and after inserting o into the final object configuration O. Taking 
view-index relationships into account implies redefining the benefit function. Let i £ Oc and 
v G Oc be a candidate index and a candidate materialized view, respectively. Adding i or v 
into O may lead to the benefit cases enumerated in Tables 5 and 6, respectively, depending on 
interactions between i and v. 
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Table 5: Benefit brought by index i 
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Table 6: Benefit brought by materialized view v 

Indexing and view materialization benefits for Q, brought by adding index i or view v into 
O, respectively, may hence be expressed as follows. 

cost(Q,0)-cost(Q,OU{i}) if yj vi ^OVWGV^CO) 



benefito(i) 



COSt(Q,0)-COSt(Q,OU{i}UV>) if y, = r g y^ yj^ = j } + , 

size(i)-\-y size(v') 



otherwise 
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if VI V i = V i G I (I C O) 



benefito(v) = < 




otherwise 



5 Experiments 



In order to experimentally validate our generic approach for optimizing data warehouse per- 
formance, we have run several series of tests. We summarize the main results in the following 
sections. Regarding isolate materialized view or index selection, the interested reader can refer 
to (Aouiche et al., 2003, 2005, 2006; Mahboubi et al., 2006) for more complete results. 

5.1 Experimental conditions 

All our tests have been run on a 1 GB data warehouse implemented within Oracle 9i, on a 
Pentium 2.4 GHz PC with 512 MB RAM and a 120 GB IDE disk. Our test data warehouse is 
actually derived from Oracle's sample data warehouse. Its star schema is composed of one fact 
table: Sales; and five dimensions: Customers, Products, Times, Promotions and Channels. 
The workload we executed on this data warehouse is composed of 61 decision-support queries 
involving aggregation operations and multiple joins between the fact table and dimension 
tables. Due to space constraints, we do not reproduce here the full data warehouse schema 
nor the detail of each workload query, but they are both available on-line 1 . 

Note that our experiments are based on an ad-hoc benchmark because, at the time we 
performed them, there was no standard benchmark for data warehouses. TPC-H (TPC, 2005) 
does indeed not feature a true multidimensional schema and thus does not qualify, and TPC- 
DS' (TPC, 2007) draft specifications had not been issued yet. 

5.2 Materialized view selection results 

We plotted in Figure 8 the variation of workload execution time with respect to the storage 
space allotted for materialized views. This figure shows that the views we select significantly 
improve query execution time. Moreover, execution time decreases when storage space occu- 
pation increases. This is predictable because we create more materialized views when storage 
space S is large and thereby better improve execution time. Let Sv be the disk space that is 
necessary to store all the candidate materialized views. The average gain in performance is 
indeed 68.9% when S = 35.4% x S V - It is equal to 94.9% when S = 100% x Sv (when the 
storage space constraint is relaxed). 

Moreover, we have demonstrated the relevance of the materialized views that are selected 
with our approach by computing query cover rate, i.e., the proportion of queries resolved by 
1 http : / /eric . univ-lyon2 . f r/~kaouiche/adbis . pdf 
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Figure 8: Views materialization experiment results 

using views. When the storage space constraint is hard (S — 0.05% x Sv), average cover rate 
is already 23%. It reaches 100% when the storage space constraint is relaxed. 

5.3 Index selection results 

In these experiments, we have fixed the minimal support parameterized in the Close frequent 
itemset mining algorithm to 1%. This value gives the highest number of frequent itemsets and 
consequently the highest number of candidate indexes. This helps vary storage space S within a 
wide interval. We have measured query execution time according to the percentage of storage 
space allotted for indexes. This percentage is computed from the space Si occupied by all 
indexes. Figure 9 shows that execution time decreases when storage space occupation increases. 
This is predictable because we create more indexes and thus better improve execution time. 
We also observe that the maximal time gain is about 30% and it is reached for space occupation 
S = 59.64% x Si. 

Finally, these experiments also showed that our index selection strategy helped select a 
portion of candidate indexes that allows to achieve roughly the same performances than the 
whole set of candidate indexes. This guarantees substantial gains in storage space (40% on an 
average) and decreases index maintenance cost. 

5.4 Joint index and materialized view selection results 

Eventually, we have compared the efficiency of isolate materialized view selection, isolate 
index selection and joint materialized view and index selection. We have measured query 
execution time in the following cases: without materialized views nor indexes (reference plot), 
with materialized views only, with indexes only and with both materialized views and indexes 
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Figure 9: Indexing experiment results 

(simultaneously selected). Figure 10 represents the variation of response time with respect to 
the storage space S allotted to materialized views and indexes. S is expressed in percentage of 
total space Svi occupied by all indexes and materialized views, achieved when we apply our 
strategy without any storage space constraint. Note that we used a logarithmic scale on the 
X axis to better visualize the results. 
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Figure 10: Joint view materialization and indexing experiment results 



Figure 10 shows that jointly selecting materialized views and indexes allows better perfor- 
mance than selecting indexes or views separately when storage space is large. However, when 
it is small, isolate index selection is more competitive than the other solutions. This may be 
explained by the fact that index size is generally significantly smaller than materialized view 
size. Then, we can store many more indexes than materialized views in a small space and 
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achieve a better performance. In conclusion, indexes should thus be privileged when storage 
space is strongly constrained. 

6 Conclusion and perspectives 

We have presented in this paper an approach for automatic data warehouse performance opti- 
mization. Our main contribution in this field mainly relates to exploiting knowledge about the 
data warehouse and its usage. Knowledge may either be formalized expertise, or automatically 
extracted with the help of data mining techniques. This approach allowed us to reduce the 
dimensionality of the materialized view and index selection problem, by proposing a reduced 
and pertinent candidate object configuration. We also have explicitly taken view-index inter- 
actions into account, to propose a final object configuration that is as close as possible to the 
optimum. 

We have designed our approach to be generic. Data mining techniques and cost models we 
exploit are indeed not related to any system in particular. They may be applied on any host 
DBMS. Our materialized view and index strategies are also modular: each step (candidate 
object selection, cost computation...) exploits interchangeable tools. The data mining tech- 
niques and cost models we used could easily be replaced by others. Moreover, we could also 
extend our approach to other performance optimization techniques, such as buffering, physi- 
cal clustering or partitioning (Agrawal, Chaudhuri, Kollar, Marathe, Narasayya, & Syamala, 
200; Zilio, Rao, Lightstone, Lohman, Storm, Garcia-Arellano, & Fadden, 2004; Bellatreche, 
Boukhalfa, & Mohania, 2005). 

Though we have systematically tried to demonstrate the efficiency of our proposals by 
experimenting on real-life systems such as Oracle, up to now, we have not been able to compare 
our proposals to existing approaches in situ. Those that are developed by DMBS vendors 
(Agrawal et al., 200; Dageville, Das, Dias, Yagoub, Zai't, & Ziauddin, 2004; Zilio et al., 2004) 
necessitate the acquisition of the corresponding system. Furthermore, they are implemented 
as "black boxes" that are often hard to tinker with. Finally, research proposals from the 
literature are not always available as source or executable code and, when it is the case, they 
operate in one given environment and must often be reimplemented. We shall have to get over 
these difficulties to complete our solutions' experimental validation, though. 

Finally, the main possible evolution for our work resides in improving our solutions' au- 
tomaticity. We indeed perform static performance optimization. If the input query workload 
significantly evolves, we must rerun the whole process to preserve performance. Dynamic 
materialized view selection approaches that have been proposed to optimize refreshing times 
(Kotidis & Roussopoulos, 1999; Shah, Ramachandran, & Raghavan, 2006) are more efficient 
than static approaches. We must work in this direction for optimizing query response time. 
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Our main lead is to exploit our approach's modularity by replacing the data mining tech- 
niques we used by incremental frequent itemset mining (Leung, Khan, Sz Hoque, 2005) or 
clustering (Jain, Murty, & Flynn, 1999) techniques. Studies related to session detection that 
are based on entropy computation (Yao, Huang, & An, 2005) could also be exploited to detect 
when to rerun the (incremental) selection of materialized views and indexes. 
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